Statistical analysis of water discharge of surface streams

Task:

  • find a dataset about a river
  • analyse the various timeseries data by fitting the right/suitable distribution to it.

Statistical distributions

Fermi-Dirac statistics, Bose-Einstein statistics, Maxwell-Boltzmann distribution, Planck’s law Binomial distribution

  • number of successes of n independent experiments Cauchy distribution
  • ration of normally distributed variables
  • shape of spectral lines Gamma distribution Geometric distribution
  • number of trials before the first success in Bernoulli trials Chi-squared distribution
  • sum of squares of independent standard normal variables
  • how to construct confidence intervals

Exponential distribution Gumbel distribution

  • expectation value of the maximum/minimum of samples drawn from a parent distribution with exp tail
  • Lévy distribution
  • heavy-tailed distribution

Logistic distribution

  • categorical dependent variables
  • logistic regression

Log normal distribution

  • product of many independent variables
  • size distribution of particles

Normal distribution

Pareto distribution

  • heavy-tailed, decreasing power-law
  • degree distribution of social networks

Poisson distribution

Student t

  • average of a normally distributed population
  • n is small Uniform distribution Weibull distribution
    • any of n iid variables being above a certain limit
  • first failure of a system of n identical components
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import plotly.graph_objects as go

import pandas as pd
import seaborn as sns
from copy import deepcopy

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

from scipy.stats import gumbel_r, gompertz, maxwell, pareto, levy, lognorm, weibull_max, expon, poisson
In [2]:
matplotlib.rcParams["figure.figsize"] = (10, 6)
matplotlib.rcParams["axes.titlesize"] = 16
matplotlib.rcParams["axes.labelsize"] = 13

matplotlib.rcParams["xtick.labelsize"] = 16
matplotlib.rcParams["ytick.labelsize"] = 16
matplotlib.rcParams["axes.grid"] = True
matplotlib.rcParams["xtick.major.size"] = 12
matplotlib.rcParams["ytick.major.size"] = 12
matplotlib.rcParams["xtick.major.width"] = 2
matplotlib.rcParams["ytick.major.width"] = 2

Excercise 1

For the purposes of the current excercises, select a stream from a rainy area with relatively small discharge so that the effect of short but strong storms is visible. Good choices are small rivers from the north-eastern US, e.g. site 01589440. Retrieve at least 10 years of data.

Large amounts of historical surface water data are available from the United States Geological Survey (USGS) at https://waterdata.usgs.gov/nwis The goal of the project is to retrieve samples from the web interface manually, and then later automate the process by calling the web service as described at https://help.waterdata.usgs.gov/faq/automated-retrievals. You can also use the dataset from the /home/course/Datasets/02-data/ folder as well.

or

Find a database of historic water level measurement of the Danube at Budapest

Manual data retrieval

THIS FIRST TWO CELLS RUN FOR ~5 MINS !!! It is not necessary to run

In [5]:
my_url = "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_00060=on&cb_00065=on&format=html&site_no=03014500&period=&begin_date=2010-02-08&end_date=2021-02-15"


#Downloading the webpage and storing its HTML right away.
resp_obj = urlopen(Request(my_url, headers={'User-Agent': 'Mozilla/6.0'}))
page_html = resp_obj.read()
resp_obj.close()    
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-5-ddde8e7297bb> in <module>
      4 #Downloading the webpage and storing its HTML right away.
      5 resp_obj = urlopen(Request(my_url, headers={'User-Agent': 'Mozilla/6.0'}))
----> 6 page_html = resp_obj.read()
      7 resp_obj.close()

/opt/conda/lib/python3.7/http/client.py in read(self, amt)
    452 
    453             if self.chunked:
--> 454                 return self._readall_chunked()
    455 
    456             if self.length is None:

/opt/conda/lib/python3.7/http/client.py in _readall_chunked(self)
    559         try:
    560             while True:
--> 561                 chunk_left = self._get_chunk_left()
    562                 if chunk_left is None:
    563                     break

/opt/conda/lib/python3.7/http/client.py in _get_chunk_left(self)
    542                 self._safe_read(2)  # toss the CRLF at the end of the chunk
    543             try:
--> 544                 chunk_left = self._read_next_chunk_size()
    545             except ValueError:
    546                 raise IncompleteRead(b'')

/opt/conda/lib/python3.7/http/client.py in _read_next_chunk_size(self)
    502     def _read_next_chunk_size(self):
    503         # Read the next chunk size from the file
--> 504         line = self.fp.readline(_MAXLINE + 1)
    505         if len(line) > _MAXLINE:
    506             raise LineTooLong("chunk size")

/opt/conda/lib/python3.7/socket.py in readinto(self, b)
    587         while True:
    588             try:
--> 589                 return self._sock.recv_into(b)
    590             except timeout:
    591                 self._timeout_occurred = True

/opt/conda/lib/python3.7/ssl.py in recv_into(self, buffer, nbytes, flags)
   1047                   "non-zero flags not allowed in calls to recv_into() on %s" %
   1048                   self.__class__)
-> 1049             return self.read(nbytes, buffer)
   1050         else:
   1051             return super().recv_into(buffer, nbytes, flags)

/opt/conda/lib/python3.7/ssl.py in read(self, len, buffer)
    906         try:
    907             if buffer is not None:
--> 908                 return self._sslobj.read(len, buffer)
    909             else:
    910                 return self._sslobj.read(len)

KeyboardInterrupt: 
In [ ]:
#Parsing the website's HTML with bs4. For some reason, I couldn't reach the body.table part in the traditional way. 
#So, I had to go with a bruteforce method. Of course, if the table would've been consisted of more columns with 
#other types of data, this code wouldn't have worked right away.


parsed_text = BeautifulSoup(page_html,"html.parser")
rows = [x for x in parsed_text.body.contents[11].contents[4].children]
rows = list(filter(lambda x: x != "\n", rows)) #filtering out newline characters

data = {} #Storing every row of the table in a key-value pair (dict)

for i in range(len(rows)):
    row = rows[i]
    datetime_timezone = row.findAll("td", {"nowrap":"nowrap"})[0].string

    discharge_value = row.findAll("td", {"nowrap":"nowrap"})[1].span.string
    discharge_comment = row.findAll("td", {"nowrap":"nowrap"})[1].sup.string

    gage_height_value = row.findAll("td", {"nowrap":"nowrap"})[2].span.string
    gage_height_comment = row.findAll("td", {"nowrap":"nowrap"})[2].sup.string
    data[f"row_{i}"] = [datetime_timezone, discharge_value, discharge_comment, gage_height_value, gage_height_comment]
In [ ]:
data_df = pd.DataFrame.from_dict(data, orient="index", columns = ["datetime_tz", "discharge_v", "discharge_c", "gage_h", "gage_c"])
In [ ]:
data_df.head()

Automated retrieval

In [ ]:
!curl "https://nwis.waterdata.usgs.gov/nwis/uv/?parameterCd=00060,00065&format=rdb&site_no=01321000&period=&begin_date=20100101&end_date=20191231&siteStatus=all" > data.csv
In [ ]:
!head -n 40 data.csv

Excercise 2

Load the downloaded data file into the processing environment paying attention to handling time stamps and perfoming the necessary data type conversions. Converting dates to floating point numbers such as unix time stamp or julian date usually makes handling time series easier. Plot the data for a certain interval to show that the effect of storms is clearly visible.

In [3]:
df = pd.read_csv("data.csv",sep='\t', comment='#',header=0)
df = df.drop(index=0)
df.head()
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3166: DtypeWarning:

Columns (1,4,6) have mixed types.Specify dtype option on import or set low_memory=False.

Out[3]:
agency_cd site_no datetime tz_cd 106704_00065 106704_00065_cd 106705_00060 106705_00060_cd
1 USGS 01321000 2010-01-01 00:00 EST 2.94 A 1030 A
2 USGS 01321000 2010-01-01 00:15 EST 2.94 A 1030 A
3 USGS 01321000 2010-01-01 00:30 EST 2.94 A 1030 A
4 USGS 01321000 2010-01-01 00:45 EST 2.94 A 1030 A
5 USGS 01321000 2010-01-01 01:00 EST 2.94 A 1030 A
In [4]:
print(df.agency_cd.unique(), df.site_no.unique(), df.tz_cd.unique())
#These columns are irrelevant, because they contain the same data
['USGS'] ['01321000' 1321000] ['EST']
In [5]:
df = df.drop(columns = ["agency_cd", "site_no","tz_cd"])
df.head()
Out[5]:
datetime 106704_00065 106704_00065_cd 106705_00060 106705_00060_cd
1 2010-01-01 00:00 2.94 A 1030 A
2 2010-01-01 00:15 2.94 A 1030 A
3 2010-01-01 00:30 2.94 A 1030 A
4 2010-01-01 00:45 2.94 A 1030 A
5 2010-01-01 01:00 2.94 A 1030 A
In [6]:
df.dtypes #The columns has to be cast into the appropriate data formats
Out[6]:
datetime           object
106704_00065       object
106704_00065_cd    object
106705_00060       object
106705_00060_cd    object
dtype: object

From the head of the .csv datafile we know what the codes are

106704 00065 Gage height, feet 106705 00060 Discharge, cubic feet per second

In [7]:
df = df.rename(columns = {"106704_00065": "gage", "106704_00065_cd": "gage_c", "106705_00060": "discharge", "106705_00060_cd":"discharge_c"})
In [8]:
df["datetime"] = pd.to_datetime(df.datetime)
df["discharge"] = df.discharge.astype(float)
df["gage"] = df.gage.astype(float)
df.head(), df.dtypes
Out[8]:
(             datetime  gage gage_c  discharge discharge_c
 1 2010-01-01 00:00:00  2.94      A     1030.0           A
 2 2010-01-01 00:15:00  2.94      A     1030.0           A
 3 2010-01-01 00:30:00  2.94      A     1030.0           A
 4 2010-01-01 00:45:00  2.94      A     1030.0           A
 5 2010-01-01 01:00:00  2.94      A     1030.0           A,
 datetime       datetime64[ns]
 gage                  float64
 gage_c                 object
 discharge             float64
 discharge_c            object
 dtype: object)
In [9]:
#We can see that there are some NaN values, the ratio of these NaNs in each columns are the following:
df.isna().sum()/len(df)
Out[9]:
datetime       0.000000
gage           0.000011
gage_c         0.000011
discharge      0.154436
discharge_c    0.154436
dtype: float64
In [10]:
df = df[df.discharge.isna() != True] #Getting rid of the rows where discharge is missing
In [11]:
df.isna().sum()/len(df)
Out[11]:
datetime       0.000000
gage           0.000014
gage_c         0.000014
discharge      0.000000
discharge_c    0.000000
dtype: float64

Excercise 3

Plot the histogram of water discharge values. Fit the data with an appropriate distribution function and bring arguments in favor of the choice of function.

In [12]:
def fitPlotDist(dist,to_fit,x, alpha = 1.0):
    
    params = dist.fit(to_fit)
    rv = dist(*params)
    plt.plot(x, rv.pdf(x), label=dist.name, alpha = alpha)
In [13]:
sns.distplot(df.discharge,bins=2000, label="raw data with KDE", kde = True)
plt.ylabel("frequency")
x = np.linspace(min(df.discharge), max(df.discharge), 1000)
#fitPlotDist(gumbel_r,df.discharge,x)
#fitPlotDist(pareto, df.discharge, x)
#fitPlotDist(maxwell, df.discharge,x)
fitPlotDist(lognorm,df.discharge,x)
fitPlotDist(levy, df.discharge, x)
fitPlotDist(expon,df.discharge,x)
plt.xlim(0,5000)

plt.legend()
plt.title("Histogram of water discharge values")
/home/dvtulf/.local/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

/opt/conda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:2494: RuntimeWarning:

invalid value encountered in double_scalars

Out[13]:
Text(0.5, 1.0, 'Histogram of water discharge values')

The data visualized above shows, that from the three distribution I've fitted it, the Lévy distribution is the best fit. I didn't find anything about the Lévy distribution on Wikipedia that would support this claim.
However, I suspect, that the lognormal distribution would be more well-founded, because as Wikipedia states:

a log-normal process is the statistical realization of the multiplicative product of many independent random variables.

Which is true in the case of discharge values: it is affected by many variables be it meteorological or geological.

Excercise 4

In case of small streams, storms and passing weather fronts with rain can significantly increase water discharge on a short time scale. Develop a simple algorithm to detect rainy events in the data. Plot the original time series and mark rainy events.

In [14]:
fig, ax1 = plt.subplots(figsize=[15,5])

ax1.plot(df.datetime,df.discharge, color="tab:orange", label="Discharge")
ax1.set_ylabel("Discharge [$ft^3$/s]",size=13)
plt.grid(ls = "dashed", color="tab:orange")
plt.legend(loc = 2,fontsize=12)


ax2 = ax1.twinx()
ax2.plot(df.datetime, df.gage, color="tab:blue",alpha = 0.6, label="Gage height")
ax2.set_ylabel("Gage height [ft]", size=13)
plt.grid(ls = "dotted",alpha = 0.5,color="tab:blue")
plt.legend(loc = 1, fontsize=12)

plt.title("Discharge and gage height plotted against time")
Out[14]:
Text(0.5, 1.0, 'Discharge and gage height plotted against time')
In [15]:
from scipy.stats import pearsonr
pearsonr(df[df.isna().sum(axis=1) < 1].discharge, df[df.isna().sum(axis=1) < 1].gage)
Out[15]:
(0.9266836620002035, 0.0)

From this graph we can conclude, that the discharge at a given cross section is linearly dependant with the gage height, which seems intuitive. This claim is also supported by the strong positive correlation score.

In [16]:
'''fig, ax1= plt.subplots(figsize=[15,5])

ax1.plot(df.datetime,df.discharge.diff(), color="tab:orange", label="Change in discharge")
ax1.set_ylabel("Change in discharge [$ft^3$/s]",size=13)
plt.grid(ls = "dashed", color="tab:orange")
plt.legend(loc = 2)


ax2 = ax1.twinx()
ax2.plot(df.datetime, df.gage.diff(), color="tab:blue",alpha = 0.6, label="Change in gage height")
ax2.set_ylabel("Change in gage height [ft]", size=13)
plt.grid(ls = "dotted",alpha = 0.5,color="tab:blue")

plt.legend(loc = 1)
plt.title("Change in discharge and in gage height plotted against time",size=16)'''
Out[16]:
'fig, ax1= plt.subplots(figsize=[15,5])\n\nax1.plot(df.datetime,df.discharge.diff(), color="tab:orange", label="Change in discharge")\nax1.set_ylabel("Change in discharge [$ft^3$/s]",size=13)\nplt.grid(ls = "dashed", color="tab:orange")\nplt.legend(loc = 2)\n\n\nax2 = ax1.twinx()\nax2.plot(df.datetime, df.gage.diff(), color="tab:blue",alpha = 0.6, label="Change in gage height")\nax2.set_ylabel("Change in gage height [ft]", size=13)\nplt.grid(ls = "dotted",alpha = 0.5,color="tab:blue")\n\nplt.legend(loc = 1)\nplt.title("Change in discharge and in gage height plotted against time",size=16)'
In [17]:
from scipy.signal import argrelextrema, find_peaks

plt.figure(figsize=[15,5])

plt.plot(df.datetime,df.discharge, color="tab:orange", label="Discharge")

#p = argrelextrema(df.discharge.to_numpy(), np.greater, order = 5)[0]
p,_ = find_peaks(df.discharge,threshold=None, height = 2000, prominence = 700)
plt.plot(df.datetime.to_numpy()[p],df.discharge.to_numpy()[p], "o", label="Possible rain",color="tab:green")
plt.ylabel("Discharge [$ft^3$ /s ]",size=13)
plt.legend()
#plt.ylim(0,3000)
plt.title("Possible rains based on sudden increase in discharge",size=16)
len(p)
Out[17]:
221
/opt/conda/lib/python3.7/site-packages/IPython/core/pylabtools.py:132: UserWarning:

Creating legend with loc="best" can be slow with large amounts of data.

In [18]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df.datetime, 
                         y=df.discharge,
                         name = "Discharge",
                         line={ "color" : "#FF7F0E", "width": 2}))

p,_ = find_peaks(df.discharge,threshold=None, height = 500, prominence = 500)

#Plot only the points, where peaks were found
fig.add_trace(go.Scatter(x = pd.to_datetime(df.datetime.to_numpy()[p]),
                         y = df.discharge.to_numpy()[p],
                         mode='markers', 
                         name='Possible rain',
                         marker = { "color" : "#2CA02C"}))

fig.update_layout(yaxis_title="Discharge [ft^3 /s ]")
#fig.update_xaxes(range=[pd.to_datetime("2014-01-01 00:00:00"), pd.to_datetime("2015-12-31 23:45:00")])
#This was my test interval
fig.show()
In [19]:
len(p)
Out[19]:
388

Excercise 5

Water discharge increases significantly during rain producing maxima in the time series. Plot the distribution of maximum values and fit with an appropriate function. Bring arguments to support the choice of probabilistic model.

In [20]:
plt.figure(figsize=[8,6])
sns.distplot(df.discharge.diff().to_numpy()[p],label="raw data with KDE", bins = 20, hist_kws = {"range" : [0,1000]})
plt.xlabel("Discharge [$ft^3$/s]",size=13)
fitPlotDist(gumbel_r,df.discharge.diff().to_numpy()[p],np.linspace(0,3000,1000))
fitPlotDist(pareto,df.discharge.diff().to_numpy()[p],np.linspace(0,3000,1000))
plt.ylim(0,0.02)

plt.xlim(0,1000)

plt.legend()
/home/dvtulf/.local/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

/opt/conda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:2494: RuntimeWarning:

invalid value encountered in double_scalars

Out[20]:
<matplotlib.legend.Legend at 0x7f3e5e679780>

We can see that the Gumbel distribution is a good fit of the tail of the data. As it is stated in the Wiki article:

This distribution might be used to represent the distribution of the maximum level of a river in a particular year if there was a list of maximum values for the past ten years.

Although the article refers to water levels (gage height feature), this hold true for the discharge as well, because of the strong positive correlation of the features.

In [21]:
from scipy.stats import pearsonr
pearsonr(df[df.isna().sum(axis=1) < 1].discharge, df[df.isna().sum(axis=1) < 1].gage)
Out[21]:
(0.9266836620002035, 0.0)

Excercise 6

Once rainy events are detected, plot the distribution of the length of sunny intervals between rains. Fit the distribution with an appropriate function.

In [22]:
t_rain = pd.Series(df.datetime.to_numpy()[p]).diff()
t_rain_hours = t_rain.map(lambda x: (x.days * 24* 3600 + x.seconds)/3600)

sns.distplot(t_rain_hours, bins = 100, label="raw data with KDE")

fitPlotDist(pareto, t_rain_hours[ t_rain_hours.notna()], np.linspace(0,2000,1000))
fitPlotDist(expon, t_rain_hours[ t_rain_hours.notna()], np.linspace(0,2000,1000))
fitPlotDist(gompertz, t_rain_hours[ t_rain_hours.notna()], np.linspace(0,2000,1000), alpha = 0.5)
plt.ylim(0,0.0065)
plt.legend()
plt.xlim(0,2000)
plt.xlabel("sunny hours between rains [h]")
/home/dvtulf/.local/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

/opt/conda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:2494: RuntimeWarning:

invalid value encountered in double_scalars

Out[22]:
Text(0.5, 0, 'sunny hours between rains [h]')

We can see on the graph, that the Gompertz and the exponential distribution is a pretty good fit for tail of the distribution, but the first part is much better fit by the Pareto distribution.

Excercise 7

What is the maximum of water discharge in an arbitrarily chosen period of one year? Calculate the maximum of water discharge due to rain in a rolling window of 1 year, plot its distribution and fit with an appropriate function.

In [23]:
df3 = deepcopy(df)
df3 = df3.set_index("datetime") #modifying index to the dates, so that the rolling window works
In [24]:
sns.displot(df3.rolling("365D").max().discharge, bins = 30, kde=False, stat="density" )
fitPlotDist(gumbel_r,df3.rolling("365D").max().discharge, np.linspace(0,4*10**4, 1000))
fitPlotDist(lognorm,df3.rolling("365D").max().discharge, np.linspace(0,4*10**4, 1000))
plt.ylabel("density")
plt.legend()
Out[24]:
<matplotlib.legend.Legend at 0x7f3ec05869b0>

This distribution could also be able to be fit by the Gumbel distribution. It is similarly an extreme value distribution as it was described before at Excercise 5.

Excercise 8

How many time does it rain in a month? (Again use a rolling window function) Calculate and plot the distribution and fit with an appropriate function.

In [25]:
from scipy.optimize import curve_fit
max_rain = 20

df3["isRainy"] = [ 1 if i in p else 0 for i in range(len(df3))] #Encode rainy days in a series
hist = plt.hist(df3.rolling("30D").isRainy.sum(),bins = np.arange(max_rain)-0.5,
            label="raw data", density = True)

#Averaging neighbouring bins
bin_mids = [ (hist[1][i] + hist[1][i+1]) *0.5 for i in range( len(hist[1]) -1 )]

#Fitting the Poisson distribution
def fit_func(k, lambd):
    return poisson.pmf(k, lambd)
params, cov_matr = curve_fit(fit_func, bin_mids, hist[0])

#Plotting
plt.vlines(bin_mids, 0, fit_func(bin_mids,params), label=f"Poisson distribution $\lambda=${np.round(params[0],3)}")
plt.plot(bin_mids, fit_func(bin_mids,params), color="black", marker ="o", ls = "")

plt.ylabel("density")
plt.legend()
plt.grid()

Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space, if these events occur with a known constant mean rate and independently of the time since the last event. Although the number of rainy events might be different by seasons.

Excercise 9

Find the measuring station you used in the excercises above on the map. Find another measurement station about 100-200 kms from it and download the data. Try to estimate the typical time it takes for weather fronts to travel the distance between the two measuring stations.

In [26]:
!curl "https://nwis.waterdata.usgs.gov/nwis/uv/?parameterCd=00060,00065&format=rdb&site_no=01510000&period=&begin_date=20100101&end_date=20191231&siteStatus=all" > data2.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.0M    0 16.0M    0     0   405k      0 --:--:--  0:00:40 --:--:-- 3676k27    0     0    130      0 --:--:--  0:00:07 --:--:--   130
In [27]:
!head -n 40 data2.csv
# ---------------------------------- WARNING ----------------------------------------
# Some of the data that you have obtained from this U.S. Geological Survey database
# may not have received Director's approval. Any such data values are qualified
# as provisional and are subject to revision. Provisional data are released on the
# condition that neither the USGS nor the United States Government may be held liable
# for any damages resulting from its use.
#
# Additional info: https://help.waterdata.usgs.gov/policies/provisional-data-statement
#
# File-format description:  https://help.waterdata.usgs.gov/faq/about-tab-delimited-output
# Automated-retrieval info: https://help.waterdata.usgs.gov/faq/automated-retrievals
#
# Contact:   gs-w_support_nwisweb@usgs.gov
# retrieved: 2021-05-24 14:25:29 EDT       (nadww01)
#
# Data for the following 1 site(s) are contained in this file
#    USGS 01510000 OTSELIC RIVER AT CINCINNATUS NY
# -----------------------------------------------------------------------------------
#
# Data provided for site 01510000
#            TS   parameter     Description
#        107450       00060     Discharge, cubic feet per second
#        107451       00065     Gage height, feet
#        229134       00045     Precipitation, total, inches
#
# Data-value qualification codes included in this output:
#        
#     A  Approved for publication -- Processing and review completed.
#     e  Value has been estimated.
# 
agency_cd	site_no	datetime	tz_cd	107450_00060	107450_00060_cd	107451_00065	107451_00065_cd
5s	15s	20d	6s	14n	10s	14n	10s
USGS	01510000	2010-01-01 00:00	EST	356	A	2.09	A
USGS	01510000	2010-01-01 00:15	EST	356	A	2.09	A
USGS	01510000	2010-01-01 00:30	EST	356	A	2.09	A
USGS	01510000	2010-01-01 00:45	EST	356	A	2.09	A
USGS	01510000	2010-01-01 01:00	EST	356	A	2.09	A
USGS	01510000	2010-01-01 01:15	EST	356	A	2.09	A
USGS	01510000	2010-01-01 01:30	EST	352	A	2.08	A
USGS	01510000	2010-01-01 01:45	EST	352	A	2.08	A
In [28]:
df2 = pd.read_csv("data2.csv",sep='\t', comment='#',header=0)
df2 = df2.drop(index=0)
df2.head()
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3166: DtypeWarning:

Columns (1,4,6) have mixed types.Specify dtype option on import or set low_memory=False.

Out[28]:
agency_cd site_no datetime tz_cd 107450_00060 107450_00060_cd 107451_00065 107451_00065_cd
1 USGS 01510000 2010-01-01 00:00 EST 356 A 2.09 A
2 USGS 01510000 2010-01-01 00:15 EST 356 A 2.09 A
3 USGS 01510000 2010-01-01 00:30 EST 356 A 2.09 A
4 USGS 01510000 2010-01-01 00:45 EST 356 A 2.09 A
5 USGS 01510000 2010-01-01 01:00 EST 356 A 2.09 A
In [29]:
print(df2.agency_cd.unique(), df2.site_no.unique(), df2.tz_cd.unique()) 
df2 = df2.drop(columns = ["agency_cd", "site_no","tz_cd"])
df2.head()
['USGS'] ['01510000' 1510000] ['EST']
Out[29]:
datetime 107450_00060 107450_00060_cd 107451_00065 107451_00065_cd
1 2010-01-01 00:00 356 A 2.09 A
2 2010-01-01 00:15 356 A 2.09 A
3 2010-01-01 00:30 356 A 2.09 A
4 2010-01-01 00:45 356 A 2.09 A
5 2010-01-01 01:00 356 A 2.09 A
In [30]:
df2.dtypes
Out[30]:
datetime           object
107450_00060       object
107450_00060_cd    object
107451_00065       object
107451_00065_cd    object
dtype: object
In [31]:
df2 = df2.rename(columns = {"107451_00065": "gage2", "107451_00065_cd": "gage_c2", "107450_00060": "discharge2", "107450_00060_cd":"discharge_c2"})
In [32]:
df2["datetime"] = pd.to_datetime(df2.datetime)
df2["discharge2"] = df2.discharge2.astype(float)
df2["gage2"] = df2.gage2.astype(float)
df2.head(), df2.dtypes
Out[32]:
(             datetime  discharge2 discharge_c2  gage2 gage_c2
 1 2010-01-01 00:00:00       356.0            A   2.09       A
 2 2010-01-01 00:15:00       356.0            A   2.09       A
 3 2010-01-01 00:30:00       356.0            A   2.09       A
 4 2010-01-01 00:45:00       356.0            A   2.09       A
 5 2010-01-01 01:00:00       356.0            A   2.09       A,
 datetime        datetime64[ns]
 discharge2             float64
 discharge_c2            object
 gage2                  float64
 gage_c2                 object
 dtype: object)
In [33]:
print(df2.isna().sum()/len(df2))
#Let's get rid of the missing data

df2 = df2[df2.discharge2.isna() != True]
print("\n",df2.isna().sum()/len(df2))
datetime        0.000000
discharge2      0.075577
discharge_c2    0.075577
gage2           0.000034
gage_c2         0.000034
dtype: float64

 datetime        0.000000
discharge2      0.000000
discharge_c2    0.000000
gage2           0.000037
gage_c2         0.000037
dtype: float64
In [34]:
plt.figure(figsize=[15,5])

plt.plot(df2.datetime,df2.discharge2, color="tab:orange", label="Discharge")

p2 = argrelextrema(df2.discharge2.to_numpy(), np.greater, order = 650)[0]
plt.plot(df2.datetime.to_numpy()[p2],df2.discharge2.to_numpy()[p2], "o", label="possible rain",color="tab:green")
plt.ylabel("Discharge [$ft^3$ /s ]",size=13)

plt.legend()
plt.title("The second measuring station's discharge values",size=16)
Out[34]:
Text(0.5, 1.0, "The second measuring station's discharge values")
/opt/conda/lib/python3.7/site-packages/IPython/core/pylabtools.py:132: UserWarning:

Creating legend with loc="best" can be slow with large amounts of data.

In [35]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df2.datetime, 
                         y=df2.discharge2,
                         name = "Discharge",
                         line={ "color" : "#FF7F0E", "width": 2}))

p2,_ = find_peaks(df2.discharge2,threshold=None, height = 300, prominence = 300)

#Plot only the points, where peaks were found
fig.add_trace(go.Scatter(x = pd.to_datetime(df2.datetime.to_numpy()[p2]),
                         y = df2.discharge2.to_numpy()[p2],
                         mode='markers', 
                         name='Possible rain',
                         marker = { "color" : "#2CA02C"}))

fig.update_layout(yaxis_title="Discharge [ft^3 /s]")
fig.update_layout(title="The second measuring station's discharge values")
#fig.update_xaxes(range=[pd.to_datetime("2014-01-01 00:00:00"), pd.to_datetime("2015-12-31 23:45:00")])
#This was my test interval
fig.show()
len(p2)
Out[35]:
354

Cross correlation can identify the similarity of two signals. The degree of similarity can be qualitatively established, by looking at the result, the correlogram. Based on how symmetric the correlogram is we can decide if the two signals are similar, and quantitatively measure the lag of these similar parts.

I used this method to determine the lag of rainy events between the two measuring stations.

In [36]:
len(df), len(df2) #We can see that the two dataframes have different amounts of data. 
#This can be attributed to some missing date
Out[36]:
(295944, 323499)
In [37]:
#Joining the tables on the common dates
df_merged = pd.merge(df, df2, how='inner', left_on='datetime', right_on='datetime')
df_merged.head(), df_merged.shape
Out[37]:
(             datetime  gage gage_c  discharge discharge_c  discharge2  \
 0 2010-01-01 00:00:00  2.94      A     1030.0           A       356.0   
 1 2010-01-01 00:15:00  2.94      A     1030.0           A       356.0   
 2 2010-01-01 00:30:00  2.94      A     1030.0           A       356.0   
 3 2010-01-01 00:45:00  2.94      A     1030.0           A       356.0   
 4 2010-01-01 01:00:00  2.94      A     1030.0           A       356.0   
 
   discharge_c2  gage2 gage_c2  
 0            A   2.09       A  
 1            A   2.09       A  
 2            A   2.09       A  
 3            A   2.09       A  
 4            A   2.09       A  ,
 (291192, 9))
In [44]:
#using scipy's crosscorrelation
from scipy.signal import correlate
sc_cc = correlate(df_merged.discharge/max(df_merged.discharge), df_merged.discharge2/max(df_merged.discharge2),
                   mode="full")

fig = go.Figure()

fig.add_trace(go.Scatter(x=np.arange(-len(sc_cc)//2, len(sc_cc)//2), 
                         y=sc_cc,
                         name = "Discharge",
                         line={ "color" : "#FF7F0E", "width": 2}))

print(np.argmax(sc_cc))
fig.update_layout(yaxis_title="Cross-correlation",
                  xaxis_title="Lag")

fig.show()
291201
In [39]:
df_merged.datetime.diff().mean() #This is the mean of sampling times
Out[39]:
Timedelta('0 days 00:18:03.590839002')
In [40]:
np.argmax(sc_cc) + 1 - len(sc_cc)//2 
#This is approximately how many times 18 mins the lag is: 18 mins x 11 = 198 mins = 3.3h
#The reason for the +1 is that, the indexing starts at 0, while the length of the array starts at 1.
#So if the array would have only one element, then the previous expression would give us -1 back without the +1.
Out[40]:
11

By using https://www.movable-type.co.uk/scripts/latlong.html I calculated the distance between the two measuring stations, which came out to be 160 km. In the previous cell, the estimated time was 3.3 hours. This means that the weather fronts usually travel at around 65 km/h, if they go straight to the second measuring station - which is hardly the case.

According to https://www.livescience.com/39004-weather-fronts-definition-facts.html :

Cold fronts generally advance at average speeds of 20 to 25 mph. (edit - 32-40 km/h)

and

Warm fronts are seldom as well marked as cold fronts, and they usually move about half as fast, at about 10 to 15 mph, and sometimes even slower. (edit - 16-24 km/h)

So the results, as the methods, although not very sophisticated, but are in a meaningful order of magnitude.

The weather front usually reaches the first measuring station, then the second one.

In [41]:
# By using pandas's pearson correlation method ... which was much slower, 
# but I found it easier to set the bounds of the shift
cc = [ df_merged.discharge.corr(df_merged.discharge2.shift( s ))
      for s 
      in range(-len(df_merged)//100, len(df_merged)//100)]
In [42]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=[*range(-len(df_merged)//100, len(df_merged)//100)], 
                         y=cc,
                         name = "Discharge",
                         line={ "color" : "#FF7F0E", "width": 2}))
fig.update_layout(yaxis_title="Cross-correlation",
                  xaxis_title="Lag")
fig.show()
In [43]:
#With this other implementation, the result came out to be 12 x 18 mins = 216 mins = 3.6 hours
np.argmax(cc) + 1 - len(df_merged)//100
#The reason for the +1 is describe above
Out[43]:
12

With this method the result is 3.6 hours, which means, that the weather front is going at 44.4 km/h.